BioJulia系列-SequenceVariation包

Julia中变异和单倍型的表示 文档原文

  • 基于两两比对的结果, 输出变异和单倍型信息;

  • 单倍型和变异类型, 包含ref信息, 可以互相比较;

  • 可以根据比对结果, 切换参考基因组;

  • 只提供了PAV(INS和DEL)和SNP(Substitutions)三种基本类型的变异

单倍型

`Haplotype`类型 PASS

从比对结果中提取变异

SequenceVariant可以使用Haplotype类型的构造函数直接获得变异信息。

julia

julia> using SequenceVariation, BioAlignments, BioSequences
julia> bovine = dna"GACCGGCTGCATTCGAGGCTGCCAGCAAGCAG";
julia> ovine  = dna"GACCGGCTGCATTCGAGGCTGTCAGCAAACAG";
julia> human  = dna"GACAGGCTGCATCAGAAGAGGCCATCAAGCAG";
julia> bos_ovis_alignment =
           PairwiseAlignment(AlignedSequence(ovine, Alignment("32M", 1, 1)), bovine);
julia> bos_human_alignment =
           PairwiseAlignment(AlignedSequence(human, Alignment("32M", 1, 1)), bovine);
julia> bos_ovis_haplotype = Haplotype(bos_ovis_alignment)
Haplotype{BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}, BioSymbols.DNA} with 2 edits:
  C22T
  G29A
julia> bos_human_haplotype = Haplotype(bos_human_alignment)
Haplotype{BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}, BioSymbols.DNA} with 7 edits:
  C4A
  T13C
  C14A
  G17A
  C19A
  T20G
  G25T

julia

根据单倍型重建query序列

Haplotype类型存储了参考序列信息, 可以利用reconstruct函数重建query序列:

julia

julia> human2 = reconstruct(bos_human_haplotype)
32nt DNA Sequence:
GACAGGCTGCATCAGAAGAGGCCATCAAGCAG
julia> human2 == bovine
false
julia> human2 == human
true

julia

切换REF

提供 RawREF => NewREF的比对结果, 就可以用translate函数把变异切换到NewREF:

julia

julia> ovis_human_alignment = PairwiseAlignment(AlignedSequence(human, Alignment("32M", 1, 1)), ovine)
BioAlignments.PairwiseAlignment{BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}, BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}}:
  seq:  1 GACAGGCTGCATCAGAAGAGGCCATCAAGCAG 32
          ||| ||||||||  || |  | || ||| |||
  ref:  1 GACCGGCTGCATTCGAGGCTGTCAGCAAACAG 32

julia> SequenceVariation.translate(bos_ovis_haplotype, ovis_human_alignment)
Haplotype{BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}, BioSymbols.DNA} with 2 edits:
  C22T
  G29A

julia

变异

`Variation`类型 PASS

构建

利用Variation构造函数:

julia

using BioSequences, SequenceVariation
bovine_ins = dna"GACCGGCTGCATTCGAGGCTGCCAGCAAGCAG"
snp = Variation(bovine_ins, "C4A")
typeof(mutation(snp)) # Substitution{DNA}
del = Variation(bovine_ins, "Δ13-14")
typeof(mutation(del)) # Deletion
ins = Variation(bovine_ins, "25ACA")
typeof(mutation(ins)) # Insertion{LongSequence{DNAAlphabet{4}}}

julia

提取

利用variations()函数从Haplotype中提取:

julia

variations(bos_ovis_haplotype)

julia

转换REF

translate函数也可以对Variation进行切换:

julia

julia> ovis_human_alignment = PairwiseAlignment(AlignedSequence(human, Alignment("32M", 1, 1)), ovine)
BioAlignments.PairwiseAlignment{BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}, BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}}:
  seq:  1 GACAGGCTGCATCAGAAGAGGCCATCAAGCAG 32
          ||| ||||||||  || |  | || ||| |||
  ref:  1 GACCGGCTGCATTCGAGGCTGTCAGCAAACAG 32

julia> human_variation = first(variations(bos_ovis_haplotype))
C22T

julia> reference(ans) == bovine
true

julia> SequenceVariation.translate(human_variation, ovis_human_alignment)
C22T

julia> reference(ans) == bovine
false

julia

变异的比较

  1. in函数用于在Haplotype中查找Variation

julia

println("Variation\tOvis aires\tHomo sapiens")
for v in vcat(variations(bos_ovis_haplotype), variations(bos_human_haplotype))
    is_sheep = v in bos_ovis_haplotype
    is_human = v in bos_human_haplotype
    println("$v\t$is_sheep\t\t$is_human")
end

julia
VariationOvis airesHomo sapiens
C22Ttruefalse
G29Atruefalse
C4Afalsetrue
T13Cfalsetrue
C14Afalsetrue
G17Afalsetrue
C19Afalsetrue
T20Gfalsetrue
G25Tfalsetrue
  1. 用变异构建新的单倍型

julia

julia> sheeple = vcat(variations(bos_ovis_haplotype), variations(bos_human_haplotype));
julia> Haplotype(bovine, sheeple)
Haplotype{BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}, BioSymbols.DNA} with 9 edits:
  C4A
  T13C
  C14A
  G17A
  C19A
  T20G
  C22T
  G25T
  G29A

julia> reconstruct!(bovine, ans)

julia

API

Edits

  • Edits类型

  • Substitution类型

  • Deletion类型

  • Insertion类型

变异发生的位置不存储在以上数据结构中。

Variants

  • Haplotype类型

Haplotype{S<:BioSequence,T<:BioSymbol} 记录了一系列变异的集合 构造函数:
julia

# 根据edits类型
Haplotype(ref::S, edits::Vector{Edit{S,T}}) where {S<:BioSequence,T<:BioSymbol}
# 根据Variation
Haplotype(ref::S, vars::Vector{Variation{S,T}}) where {S<:BioSequence,T<:BioSymbol}
# 根据序列比对
Haplotype(aln::PairwiseAlignment{T,T}) where {T<:LongSequence{<:Union{BS.AminoAcidAlphabet,BS.NucleicAcidAlphabet}}}

julia

当从EditVariation构造时, 从序列的第一个位置到最后一个位置依次引用, 因此必须始终按照位置对向量排序。

  • reference(h::Haplotype): 提取h的参考序列

  • variations(h::Haplotype{S,T}) where {S, T}: 把h中的edits转变成Variation向量

  • reconstruct(h::"Haplotype): 重构query序列

  • translate(h::Haplotype{S,T}, aln::PairwiseAlignment{S,S}) where {S,T}

根据aln切换ref, 位于开头或结尾的INDEL可能无法保存。

Variations

  • Variation类型:

Variation{S<:BioSequence,T<:BioSymbol} 包含ref和edit, 比Edit类型更稳健。
julia

struct Variation{S<:BioSequence,T<:BioSymbol}
    ref::S
    edit::Edit{S,T}

    function Variation{S,T}(
        ref::S, e::Edit{S,T}, ::Unsafe
    ) where {S<:BioSequence,T<:BioSymbol}
        return new(ref, e)
    end
end

julia
构造函数:
julia

Variation(ref::S, e::Edit{S,T}) where {S<:BioSequence,T<:BioSymbol}
Variation(ref::S, edit::AbstractString) where {S<:BioSequence}

julia
鼓励使用variations(h)的方式构造。
  • reference(v)

  • mutation(v): 获得v对应的edits

  • translate(v, aln)

  • refbases(v): 获得v的REF序列

  • altbases(v): 获得v的ALT序列, 对于INS, 会根据VCFv4 Spec, 包含插入前的碱基: 25ACA: G => GACA

思考

  • 传统的SV包含INS,DEL,TRA,INV,DUP,BND等类型, 如何用三种基本的Edits类型表示?

  • 串联重复序列呢?